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^ Abstract: It has recently been shown that an unbinned distance-based statistic, the energy, can 

^ be used to construct an extremely powerful nonparametric multivariate two sample goodness-of- 

• fit test. An extension to this method that makes it possible to perform nonparametric regression 

O using multiple multivariate data sets is presented in this paper. The technique, which is based 

^ on the concept of minimizing the energy of the system, permits determination of parameters of 

r~| interest without the need for parametric expressions of the parent distributions of the data sets. 

, ^ The application and performance of this new method is discussed in the context of some simple 

example analyses. 
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1. Introduction 

The concept of nuissance parameters, unknown parameters whose values are of no interest but must 
be determined so that estimators for the parameters of interest can be obtained, is a well known one 
in high energy physics. The name is apt in that the presence of such parameters increases the 
uncertainty on the parameters of interest but has little affect on how the analysis is performed. 
E.g., if the functional form of the probability density fuction (p.d.f.) is known, then the unkown 
parameters in the p.d.f. are typically determined using the least squares or maximum hkelihood 
methods. The presence of nuissance parameters simply increases the number of parameters whose 
values must be determined but does not affect how the values are obtained. 

The same cannot be said for nuissance distributions, i.e., distributions whose functional form 
is unknown and is of no interest but (seemingly) must be determined to obtain estimators for the 
parameters of interest. The most common solution to this problem in high energy physics is to 
obtain an estimate for such a p.d.f. either by fitting a model to the data or by binning the data to 
obtain the integral of the p.d.f. inside the bin. Both of these methods have their limitations: the 
(often unknown) systematic uncertainties in the model must be propagated back to the parameters, 
while binning the data results in information loss that tends to increase the statistical uncertainties. 
Another method that is used (albeit less frequently) in high energy physics is nonparametric kernel 
regression (see, e.g., Ref. [HU). This approach provides an unbinned data-driven way of obtaining 
an estimate for an unknown p.d.f.; however, its power and reliability is strongly tied to the quality 
of the p.d.f. estimate (which can be difficult to assess). 

All of these methods share the same basic underlying idea: one must obtain an estimate of the 
p.d.f. to obtain estimators for the parameters of interest. In this paper I will show that this is not the 
case. If one has obtained multiple (possibly multivariate) data sets whose p.d.f.'s are known to be 
related by some set of parameters, then the values of those parameters can be determined without 
the need for any estimates of the p.d.f.'s themselves; all that is required is the data obtained from 
the p.d.f.'s. This paper is laid out as follows: the method is presented in Section 0; some example 
applications are given in Section while a summary and discussion is provided in Section ^ 
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2. Method 

Consider the case where n^ data sets with completely unknown p.d.f.'s, denoted by f\ {x) . . .fnj{x), 
have been obtained (p.d.f.'s are normalized such that J f{x)dx = 1). Furthermore, data has also 
been taken that is known to have a p.d.f. that can be written as follows: 

/(^) = LA-/;-(^), £a- = i, (2.1) 

i i 

where j3, are the unknown parameters of interest in the analysis. The scenario studied in this paper 
is that the analyst seeks to measure the parameters j3, but has no interest in the p.d.f.'s /,-. 

The following test statistic correlates the difference between two p.d.f.'s at different points in 
a multivariate space [Q, OD : 
1 



2 J J {f{x)-Mx)){f{x')-Mx'))Y{\x-^\)dxd^ 

= 11 1 [fi^)fi^) +W)fo{^)-2f{x)Ux')] ^if(\x-^\)dxdx\ (2.2) 

where i//^(|x — /j) is a weighting function. T can be estimated without the need for any knowledge 
about the forms of/ and /o using data sampled from the p.d.f.'s as 

1 « 1 "0 1 'i,no 

where Axjj = \xi — Xj\ and n (no) is the number of events sampled from / (/o). In the order in which 
they appear in Eq. \r% the sums are over pairs of / events, pairs of /o events and pairs consisting 
of an / event and an /o event, respectively. Eq. ^ is simply Eq. ^ rewritten using the fact that 
J f{x)dx = J fo{x)dx = 1, along with the standard Monte Carlo integration approximation. 

It is straightforward to calculate T in this way once a metric is chosen that defines distance in 
the multivariate space (see Ref. [^ for a detailed discussion on metrics; this choice has almost no 
affect on the results). It is worth noting that the larger the difference is between / and /o the larger 
the expectation value of T becomes; thus, T can be used to determine the goodness of fit (g.o.f.) 
of the data to the hypothesis / = /o (for a more detailed discussion, see Refs. [0, 0, ^). From this 
point forwai^d I will follow Ref. [0] and refer to this test as the energy test (the name originates 
from the fact that if y{^) = 1 A then Eq. ^^ is the electrostatic energy of two chai^ge distributions 
of opposite sign; the minimum energy occurs when / = /o). 

I will now extend the method of Ref. [0, 0] to allow for regression using multiple data sets 
sampled from different p.d.f.'s. Consider the following test p.d.f.: 

/o(^,^) = £A/<(^), IA = 1, (2.4) 

i i 

where j3, are unknown real parameters. The T-value that compares / and /o in this case is 
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where n^ is the number of events in the data set sampled from fk{x). In the order in which they 
appear in Eq. ^31 the sums are over pairs of / events, pairs of fk events, pairs consisting of an fu 
event and an // event and pairs consisting of an / and an fk event, respectively. The values of ^u 



that minimize T, j3,t> provide the best g.o.f. In the hmit «,«^ — ;• oo Eq. ^^ approaches Eq. ^^ and 



T = Oii i5k = Pk for each k. It is important to note that Eq. g^ is defined using properly normalized 
p.d.f.'s. If unnormalized functions are used, then the estimators are related to the true values by 
Pk = Pk J fk{x)dx. The uncertainty on the fit parameter values is easily obtained using a resampling 
technique (e.g., bootstrapping). 

One might think that calculating T takes so much CPU time that running this fit is not practi- 
cal; however, a careful inspection of Eq. ^3| reveals that the CPU-intensive components of T, the 
^ Va(Ax) terms, do not depend on the jS^ values. Thus, these terms only need to be calculated once 
(which can be done prior to running the fit). A limitation of this method, however, is that if a p.d.f. 
contributes negative probability to /o, then the fits may be unstable due to fact that it is not possible 
to constrain /o to be non-negative for all x. In many cases it will be easy to tell if this is a problem 
or not. Otherwise, I recommend performing a Monte Carlo study prior to using this method in such 
situations. N.b., in Section ^^ an example is presented where n^ = 3 and one p.d.f. contributes 
negative probability to /o and the method still works properly (in that example, the known rela- 
tionships between the p.d.f.'s makes it easy to determine that the p.d.f. is non-negative everywhere 
even though one term contributes negative probability). 

In the next section I will present two examples to help illustrate how the method works. The 
main goal will be to not only demonstrate how to use the method but to also solidify in the reader's 
mind what the j3^ values represent. I conclude this section by restating the main result of this work: 
estimators for the parameters Pk in Eq. ^Tl] can be obtained by minimizing Eq. ^3| using data sets 
sampled from f^ without the need for any knowledge about the p.d.f.'s themselves. 

3. Example Applications 

The examples I will present in this section are meant to demonstrate how the method works. From a 
physics perspective, these ai^e not the most interesting applications of the method; however, they are 
effective at illustrating how to use the method and how to interpret the meaning of the fit parameters. 
I will first present a very simple univariate example followed by an example using Dalitz plots. The 
weighting function that will be used for the examples below is i//(;c) = — log (x + e), where e is of 
the order of the inverse of the total number of events in all samples combined (e simply guards 
against an infinite contribution due to two extremely close events; the exact value is not important). 
See Section for discussion on the choice of weighting function. 

3.1 Univariate Example 

Consider the case where one has three data sets sampled from the p.d.f.'s fc{x) = I, fi{x) = 2x and 
f{x) = {ax + b)/{a/2 + b) on the interval jc G [0, 1). The sample sizes of these data sets are denoted 
by ric, ni and n, respectively. The p.d.f. / can be rewritten in terms of the other two p.d.f.'s as 

pi + po 
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Figure 1. j3i values obtained for the ensemble of data sets described in Section ^A using the energy test 
for «£. / = Ik (black, dashed) and rid = 10k (black, solid). Results obtained using the binned X^ ^st (red, 
dotted) and maximum likelihood test (blue, dash-dotted) are also shown. 



where j8i = a/2, Po = b and the normalization requirement on the j3 values has been made explicit. 
The values ^\/{^\ + j8o) and j8o/(j8i +j8o) represent the fraction of /'s probability associated with 
// and fc, respectively. E.g., if // and fc represent signal and background p.d.f.'s then the signal 
purity is given by P\/{pi + j8o). The factor of 2 in j8i arises because the function x is not a properly 
normalized p.d.f. on this interval. 

The fit procedure is simple: first, all of the X! wi^ij) terms in Eq. 2.5 need to be calculated. 



This can be time consuming for large data sets but it only needs to be done once. The test p.d.f. is 
then written as 



Mx,p) 



Pi+Po 



(3.2) 



where no knowledge about the forms of // and fc is assumed to be known and the normalization 
requirement on the j8 values has again been made explicit. Calculating T using Eq. ^3| is then 
straightforward for any given values of j3i and j3o. The estimators for j8,- are the values that minimize 
T, A. 

I generated an ensemble of 100 data sets from Eq. ^TT] using the values j8o = j3i = 1/2 (or, 
equivalently, a = l,b = 1/2). I chose n = 500 and two different values for ric and «/: rid = Ik 
and Hcj = 10k. Figure |l] shows the j3i values (j3o is just 1 — j8i) obtained for these data sets using 
the minimum energy, x^ and maximum likelihood tests (statistics are given in Tab. |l]). The x^ and 
maximum likelihood tests were performed by fitting the function fo{x) oc j8i.x + j8o to the data sets 
sampled from /; i.e., the data sets sampled from fc and // were not used and, instead, the functional 
form of / was assumed to be known. Even though I cheated (i.e. , used information I assumed I do 
not actually have access to) using these tests, the performance of the energy test is still comparable. 
The method works: I have determined the fraction of probability in / from fc and // without having 
any knowledge of the forms of fc or // and without trying to approximate fc and // using the data 
(e.g., using a kernel-based method). 

As stated above, the uncertainty on the fit parameters must be obtained using a data resampling 
technique. There are many such techniques available; I chose to use bootstrapping. This technique 
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Table 1. j3i statistics obtained for the ensemble of data sets described in Section "lA using the j^, maximum 
likelihood and energy tests. The data sets were sampled from a p.d.f. with a value of 0.5. 




Figure 2. j3i pull distribution obtained for the ^c,/ = 10k data sets where the parameter uncertainties are 
estimated using the bootstrapping technique. The solid line shows a fit to a gaussian distribution; the fit 
yields ;U =0.08±0.13 and a = 1.16±0.11. 



involves making ?iboot (I chose 100) resampled data sets from each /, /^ and // data set. These 
bootstrap copy data sets are produced by sampling with replacement (thus, the sample sizes are 
unchanged) from the originals. The fit parameter j8i is then computed for each of the ?iboot bootstrap 
data sets; the standard deviation of this distribution is used as an estimate for the uncertainty on 
j3i. Figure g shows the j3i pull distribution obtained for the ensemble of data sets described in the 
previous paragraph. The uncertainty on j3i was determined for each data set in the ensemble using 
bootstrapping. The pull distribution is consistent with the expected standard normal distribution; 
thus, I conclude that the bootstrap method produces reliable estimates for the uncertainty on the jS^t 
values. For more information on bootstrapping, see Ref. [01 . 

3.2 Multivariate Example 

I will now consider a multivariate example involving Dalitz plots. The decay D — ;■ Ks%^%^ is 
allowed for D = D^ ,D^ and any superposition of D° and D^. Experimentally, flavor-tagged samples 
{i.e., pure D" or D°) can be obtained using the decays Z)*+ — ;• D^n^ and D* — ;• D°;r^ since the 
charge of the slow pion tags the flavor of the D. Such data sets with sample sizes of ^(lOOk) have 
been obtained at Belle [Qj. CP eigenstates can be obtained using quantum correlated D^D^ pairs 
produced in the reaction e^e^ — )■ D^D^. CP-tagged data sets have been obtained at CLEO with 
sample sizes of several hundreds of events [0], while sample sizes of i^(lOOO) have been recorded 
(but not yet published) at BES M- 



I will proceed under the assumption of no CP violation in the neutral D system, which is 
known to be valid to a high level of precision. I will denote the quantum mechanical amplitudes 
for the decays D^,D^ ^ Ksiz^ii^ as follows: 

^Lfi^Ksn+n-{x) = ^{^), (3-3) 

^Ifi^Ksn+n-{^) = ^{^)^ (3-4) 

where x = {m\,m^_) and m\ and m^ are the invariant masses of the ^"571+ and KsK^ systems, 
respectively. The flavor-tagged p.d.f.'s are then given by 

f{x) = \s^{x)\^/J^, (3.5) 

/(x) = |i/(x)|Vj^, (3.6) 

where --^ = j \s^{x)\^dx = J\£/{x)\'^dx (which is valid in the absence of CP violation). The am- 
plitudes for the case where the D is tagged to be in a CP eigenstate are given by 

^^(f) = -—[£/{x)±£/{x)]. (3.7) 

The CP-tagged p.d.f.'s are thus 

f±{x) = {\£/{x)\^ + \£/{x)\^ ±2\£/{x)\\£/{x)\cosAe{x)) /y±, (3.8) 
where 



^. = 2r^±/K-p)||...(.-)|cosAepKJ (3.9, 



and A0 (jc) is the phase difference between £/{x) and £/{x) at each point x. I generated an ensemble 
of 100 data sets for each flavor- and CP-tagged decay using the model of Ref. [0] (the amplitudes 
were evaluated using the qf t++ package [@]). Figure shows an example of a data set of each 
tagging type. The sample sizes were chosen to be n = n= 100k and n± = 1000 for the Z)°, D° and 
Dcp± decays, respectively. 

As an example of using the method presented in this paper, consider the case where one has 
collected both flavor-tagged and CP-tagged data sets. I will assume that the only knowledge about 
f{x) and f{x) is the fact that J\£/(x)\^dx = J\£/{x)\-^dx; the functional forms of / and / are 
completely unknown. I will also assume that the CP-odd tagged data set is known to follow Eq. ^^ 
however, only the relative coefficients of \£/{x)\^, \£/{x)\-^ and |i2/(.x)||^(f)| cosA0(f) are known 
(nothing is known about their functional forms). For the CP-even data set, I assume that its p.d.f. 
can be written as 

f+{x,a) oc aQ\£/{x)\^ + ai\s!/{x)\^ + a2\s!/(x)\\£/{x)\cosAd{x). (3.10) 

The goal of the analysis is to test the quality of the CP tagging by determining the a,- values (up 
to an arbitrary normalization factor) using only the available knowledge about the data collected, 
which does not include any knowledge about the functional forms of £/{x) or £/{x). A deviation 
from the CP-even values, a oc (1,1,2), would signify a problem in the CP tagging. 
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Figure 3. Dalitz plots for D -> Ksn+n' for (top left) D", (top right) D°, (bottom left) Dc/'+ and (bottom 
right) Dc/>-. 



The energy test can be used to determine the j8/ values in the following test p.d.f. using the 
same procedure described in the previous section: 

/o(^,i8) = /3o/(^)+i8i/(f) + i82/-(f), £A = 1, (3.11) 

where /, / and /_ are the flavor-tagged and CP-odd-tagged p.d.f.'s, respectively (as defined above). 



Comparing Eq. |3.1l| and Eq. |3.10| one can see that the /3 values are related to the a values by 



« - (j8o + i82 J^M- , j8i + j82^/ J^- , -2jS2^/^- ) . 



(3.12) 



Thus, to determine the coefficients of |£/(x)p, \.s/{x)\'^ and |^(f)||i2/(f)|cosA0(f) for the CP- 
even data set {i.e., to determine f+{x) using the energy test) requires the value of J^^j ^ to be 
known (or obtainable). At first glance one might think that this is a showstopper; however, the ratio 
of these integrals can be obtained from the sample sizes obtained for the various data sets. 

Ref. [[7|] measured both the flavor- and CP-tagged yields using quantum-correlated D^T)^ pairs; 
thus, ^_/j^ can be obtained using 



J _ {n'/np + n'/n^)/! 



(3.13) 




Figure 4. Values obtained for a in Eq. P.IQ using the energy test. The solid lines shows fits of gaussian 
distributions to the data. The resulting means and widths are: jUq = l.OOiO.Ol, Gq = 0.07±0.01; jJ-i = 
0.98 ±0.01,(71 =0.07 ±0.01 and jU2= 2.02 ±0.02, (72 =0.14 ±0.02. 



where no, n^ and ncp- are the total number of D°, D° and CP-odd tagged quantum-correlated 
D^D^ pairs, respectively, and n\h' are the number of flavor-tagged D^ ,D^ — > KsTiii decays ob- 
served at CLEO (not the ones observed at Belle; the Belle data are used in the fits because they 
have larger sample sizes). The sample sizes are no ~ lOOn', n/j ~ lOO/i' and ricp- ~ 100?i-. The 
uncertainties on the various yield ratios in Eq. [3.13| are binomial; thus, the statistical uncertainty on 
J^ / J^^ is negligible due to the fact that the total number of CP-tagged quantum-correlated D^D^ 
pairs is lOOx larger than the CP-tagged D — ^ KsTiTl samples. 



Figure ^ shows the results obtained using the energy test to determine j3 and Eq. |3.12| to convert 
these into a values. The results are in excellent agreement with the true values a = (1, 1,2). The 
relative statistical uncertainty obtained on the a values is 7% (see Section ^ for discussion on 
improving this resolution). Looking at Fig. 0one can imagine that obtaining a model p.d.f. for this 
analysis with a small systematic uncertainty requires a lot of work. Since the goal of this example 
was to simply obtain estimators for a (to test the quality of the CP tagging), there is no reason to 
build such a model. Instead, one can just use the energy test. 



4. Background & Efficiency 

In these examples I have ignored two very important experimental issues: (1) the existence of 
background events and (2) the presence of non-uniform detector efficiency effects. Both of these 
can be accounted for in this method by introducing a weighting factor, w, for each event. The 
weight factor is simply given by w,- = Pg/Pj), where P^ and P^ are the probabilities that the event is 
signal (and not background) and is detected, respectively. Eq. ^3| would then need to be rewritten 
as follows: 

I'd n n nt,ni nj o n,nk 



where W and Wk are the sum of the weight factors for the / and fk data sets, respectively. If w,- = 1 



for all events in all data sets then Eq. L5 is recovered (up to a negligible change, n{n — 1) — )• n^, 
in the same data set sums; this change has been made solely for ease of notation and could easily 
be omitted if desired). N.b., in many cases these weights can be left out. It is up to the analyst to 
decide when these factors are important and to include them when necessary. 



5. Summary & Discussion 

In this paper I have shown that the concept of minimum energy can be used to carry out nonpara- 
metric regression. The parameter values that are obtained give the fractional probability associated 
with each p.d.f. For many analyses, e.g., signal-background subtraction, this is sufficient. To con- 
vert these to coefficients of unnormalized functions requires inputting the ratio of the integrals of 
these functions. These ratios can often times be obtained from measured event yields (an exam- 
ple of which was given in Section ^^ . I have also shown that data resampling methods, e.g., 
bootstrapping, can be used to obtain good estimates of the uncertainties on the parameter values. 

It is well known that in many binned analyses specialized binning schemes can be developed 
to enhance the resolution on the parameters of interest of the x^ test. The same is true for the 
energy test with regards to the weighting function ^lf{x). In this paper I chose a simple generic 
i//^(x) because the goal was to demonstrate how the method works; however, for many analyses 
it will be possible to obtain a better resolution by choosing a ^{x) tailored to the problem being 
analyzed. Refs. [0, 0] discuss several possible choices for i/a(x) and there are undoubtedly many 
other possibilities as well; however, for most analyses (where the p.d.f.'s vary in a smooth and slow 
way) the choice used in this paper is likely to be close to optimal (whatever that may be). 
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